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Abstract 

There has been a lot of work fitting Ising models to multivariate binary data in 
order to understand the conditional dependency relationships between the variables. 
However, additional covariates are frequently recorded together with the binary data, 
and may influence the dependence relationships. Motivated by such a dataset on 
genomic instability collected from tumor samples of several types, we propose a sparse 
covariate dependent Ising model to study both the conditional dependency within 
the binary data and its relationship with the additional covariates. This results in 
subject-specific Ising models, where the subject's covariates influence the strength of 
association between the genes. As in all exploratory data analysis, interpretability of 
results is important, and we use l\ penalties to induce sparsity in the fitted graphs and 
in the number of selected covariates. Two algorithms to fit the model are proposed 
and compared on a set of simulated data, and asymptotic results are established. The 
results on the tumor dataset and their biological significance are discussed in detail. 

Key Words: Graphical model, Lasso, Ising model, Binary Markov network, covari- 
ates. 



1 Introduction 



Markov networks have been applied in a wide range of scientific and engineering problems to 
infer the local conditional dependency of the variables. Examples include gene association 



studies (Peng et al. 2009; Wang et al. , 2011), image processing (Hassner h Sklansky, 1980 



Woods, 1978), and natural language processing (Manning & Schutze, 1999). A pairwise 



Markov network can be represented by an undirected graph G = (V,E), where V is the 
node set representing the collection of random variables, and E is the edge set where the 
existence of an edge is equivalent to the conditional dependency between the corresponding 
pair of variables, given the rest of the graph. 

Previous studies have focused on the case where an i.i.d. sample is drawn from an un- 
derlying Markov network, and the goal is to recover the graph structure, i.e., the edge set 
E, from the data. Two types of graphical models have been studied extensively: the mul- 



tivariate Gaussian model for continuous data, and the Ising model (Ising, 1925) for binary 
data. In the multivariate Gaussian case, the graph structure E is completely specified by the 
off-diagonal elements of the inverse covariance matrix, also known as the precision matrix. 
Therefore, estimating the edge set E is equivalent to identifying the non-zero off-diagonal 
entries of the precision matrix. Many papers on estimating the inverse covariance matrix 
have appeared in recent years, with a focus on the high-dimensional framework, for example, 



Meinshausen & Buhlmann (2006); Yuan & Lin (2007); Rothman et al. (2008); d'Aspremont 



et al. (2008); Rocha et al. (2008); Ravikumar et al. (2008); Lam & Fan (2009); Peng et al. 



(2009 


); 


Yuan 


(2010 


); 


Cai et al. 


(2011b) 



methods, and many establish asymptotic properties such as consistency and sparsistency. 
Many have also proposed fast computational algorithms, the most popular of which is per- 



haps glasso by Friedman et al. (2008), which was recently improved further by Witten et al. 



(2011) and Mazumder & Hastie (2012). 
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In the Ising model, the network structure can be identified from the coefficients of the in- 
teraction terms in the probability mass function. The problem is, however, considerably more 
difficult due to the intractable normalizing constant, which makes the penalized likelihood 



methods popular for the Gaussian case extremely computationally demanding. Ravikumar 



et al. (2010) proposed an approach in the spirit of Meinshausen & Buhlmann (2006 )'s work 
for the Gaussian case, fitting separate ^i-penalized logistic regressions for each node to in- 
fer the graph structure. A pseudo-likelihood based algorithm was developed by |Hofling EE 



Tibshirani (2009) and analyzed by Guo et al. (2010c). 



The existing literature mostly assumes that the data are an i.i.d. sample from one un- 
derlying graphical model, although the case of data sampled from several related graphical 



models on the same nodes has been studied both for the Gaussian and binary cases Guo 



eFal~1 ( |2010bpi| ). However, in many real-life situations, the structure of the network may 



further depend on other extraneous factors available to us in the form of explanatory vari- 
ables or covariates, which result in subject-specific graphical models. For example, in genetic 
studies, deletion of tumor suppressor genes plays a crucial role in tumor initiation and devel- 
opment. Since genes function through complicated regulatory relationships, it is of interest 
to characterize the associations among various deletion events in tumor samples. However, 
in practice we observe not only the deletion events, but also various clinical phenotypes for 
each subject, such as tumor category, mutation status, and so on. These additional fac- 
tors may influence the regulatory relationships, and thus should be included in the model. 
Motivated by situations like this, here we propose a model for the conditional distribution 
of binary network data given covariates, which naturally incorporates covariate information 
into the Ising model, allowing the strength of the connection to depend on the covariates. 
With high- dimensional data in mind, we impose sparsity in the model, both in the network 
structure and in covariate effects. This allows us to select important covariates that have 
influence on the network structure. 



There have been a few recent papers on graphical models that incorporate covariates, but 



they do so in ways quite different from ours. Yin & Li (2011 ) and Cai et al. (2011a) proposed 



to use conditional Gaussian graphical models to fit the eQTL (gene expression quantitative 
loci) data, but only the mean is modeled as a function of covariates, and the network remains 



fixed across different subjects. Liu et al. (2010) proposed a graph- valued regression, which 



partitions the covariate space and fits separate Gaussian graphical models for each region 
using glasso. This model does result in different networks for different subjects, but lacks 
interpretation of the relationship between covariates and the graphical model. Further, there 
is a concern about stability, since the so built graphical models for nearby regions of the 
covariates are not necessarily similar. In our model, covariates are incorporated directly into 
the conditional Ising model, which leads to straightforward interpretation and "continuity" 
of the graphs as a function of the covariates, since in our model it is the strength of the edges 
rather than the edges themselves that change from subject to subject. 

The rest of the paper is organized as follows. In Section |2j we describe the conditional 
Ising model with covariates, and two estimation procedures for fitting it. Section[3]establishes 
asymptotic properties of the proposed estimation method. We evaluate the performance of 
our method on simulated data in Section |4| and apply it to a dataset on genomic instability 
in breast cancer samples in Section |5j Section [6] concludes with a summary and discussion. 

2 Conditional Ising model with covariates 
2.1 Model set-up 

We start from a brief review of the Ising model, originally proposed in statistical physics by 



Ising (1925). Let y = (yi, . . . , y q ) e {0, l} q denote a binary random vector. The Ising model 



specifies the probability mass function Pe(y) as 



p e{y) = ^y ex P (^2 e m + ^ e mvkj 



where 6 = (6>n, 6>i 2 , . . . , 9 q -i q , 9 qq ) is a q(q + l)/2-dimensional parameter vector and Z{6) is 
the partition function ensuring the 2 q probabilities summing up to 1. Note that from now 
on we assume 9j k equals to 9 k j unless otherwise specified. The Markov property is related 
to the parameter 6 via 

9 jk = yj 1 y k || y\ m , Vj + k, (1) 

i.e., yj and y k are independent given all other y's if and only if 9jk = 0. 

Now suppose we have additional covariate information, and the data are a sample of n 
i.i.d. points V n = {(x 1 , y 1 ), . . . , (x n , y n )} with x l £ W and y % G {0, l} 9 . We assume that 
given covariates x, the binary response y follows the Ising distribution given by 

q 



P ( y \ x ^ = Z(0(x)) 6XP I ^ e 3i( x )Vj + Yl e A x )yjVk I • ( 2 ) 

We note that for any covariates x l , the conditional Ising model is fully specified by the 
vector 0(x i ) = (0 n (x i ),0i 2 (x i ), . . . , ©^(a;*), M (aj*)), and by setting 6 kj (x) = 6 jk (x) for 
all j > k, the functions 6j k (x) can be connected to conditional log-odds in the following way, 

^( i-X;=^) ) =9 " w+ g/ itWfe (3) 

where, j/u = (y%, . . . , J/j+i, ■ ■ ■ ,y q )- Further, conditioning on J/w^fc} being 0, we also 
have 

/ Pfa- = l,y fc = 1| y Uj - fc} ,a;)P(y 3 - = 0,y fc = 0| = ^ 

= 1.2/k = °l y\{j,k^ x ) p {v3 = °'Vk = !| y\{j,k}i x )j lk 

Similarly to ([T|, this implies yj and are conditionally independent given covariates x and 
all other y's if and only if Oj k (x) = 0. 

A natural way to model 0j k (x) is to parametrize it as a linear function of x. Specifically, 
for 1 < j < k < q, we let 

6 jk (x) = 9 jk0 + 6j k x, where 0j k = (9 jkl , . . . , 9 jkp ) 
9 jk (x) = k j{x), Vj > k 
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The model can be expressed in terms of the parameter vector 6 = (6*110, @n, #120, #12 > ■ ■ ■ ■> ^Wb 0q q ) 
as follows: 

PMX) = Z(6{x)) 6XP (j^ 0ji ° + ° TjjX)Vj + ^ djk ° + ^ kX)m ") ' (4) 
Instead of (pj, we now have the log-odds that depend on the covariates, through 

l0g f 1 -P(^=!|^l^ = ^° + + D ( ^' fc0 + (5) 



The choice of linear parametrization for Oj^ix) has several advantages. First, (|5]) mirrors 
the logistic regression model when viewing the a^'s, y^'s and x^y^s (k 7^ j) as predictors. 
Thus the model has the same interpretation as the logistic regression model, where each 
parameter describes the size of the conditional contribution of that particular predictor. 
Second, this parametrization has a straightforward relationship to the Markov network. 
One can tell which edges exist and on which covariates they depend by simply looking at 
6. Specifically, the vector (9jko, 6jk) being zero implies that y^ and yj are conditionally in- 
dependent given any x and the rest of y^s, and 6jke being zero implies that the conditional 
association between yj and y^ does not depend on xg. Third, the continuity of linear func- 
tions ensures the similarity among the conditional models for similar covariates, which is a 
desirable property. Finally, the linear formulation promises the convexity of the negative 
log-likelihood function, allowing efficient algorithms for fitting the model discussed next. 

2.2 Fitting the model 

The probability model Pg(y\x) in Q includes the partition function Z(0(x)), which requires 
summation of 2 q terms for each data point and makes it intractable to directly maximize 

n 

the joint conditional likelihood log P e (y l \x l ). However, (|5| suggests we can use logistic 

i=l 



regression to estimate the parameters, an approach in the spirit of Ravikumar et al. (2010). 
The idea is essentially to maximize the conditional log-likelihood of yj- given y\. and x % 
rather than the joint log-likelihood of y % . 



Specifically, the negative conditional log-likelihood for yj can be written as follows 

1 A. . ,, . . 1 ' 



= --E^W^) = -„E ( 1 °g( l + e ' ?j ) -m) . w 



1=1 i=l 



where 







Note that this conditional log-likelihood involves the parameter vector only through its 
subvector 6j = (6jio,Oji, ■ ■ ■ ^jgOi^fq) £ 1R^ P+1 ' ) ' 3 , thus we sometimes write £j(6j]V n ) when 
the rest of 6 is not relevant. 

There are (p + l)q(q + l)/2 parameters to be estimated, so even for moderate p and q 
the dimension of 6 can be large. For example, with p = 10 and q = 10, the model has 605 
parameters. Thus there is a need to regularize 6. Empirical studies of networks as well as 
the need for interpretation suggest that a good estimate of should be sparse. Thus we 
adopt the l\ regularization to encourage sparsity, and propose two approaches to maximize 
the conditional likelihood (j6]). 

Separate regularized logistic regressions 

The first approach is to estimate each dj, j = 1, . . . , q separately using the following criterion, 

min £j{0j; £>„,) + A||0a o ||i, 

J eR(p+ 1 )9 

where 0j\o = Qj\{@jjo}, that is, we do not penalize the intercept term OjjQ. 

In this approach, 6jk and 6kj are estimated from the jth and kth regressions, respectively, 
thus the symmetry = O^j is not guaranteed. To enforce the symmetry in the final 



estimate, we post-process the estimates following Meinshausen & Biihlmann (2006), where 
the initial estimates are combined by comparing their magnitudes. Specifically, let 9jkt 
denote the final estimate and 9® ki denote the initial estimate from the separate regularized 
logistic regressions. Then for any 1 < j < k < q and any I = 0, . . . ,p, we can use one of the 
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two symmetrizing approaches: 



separate-max: 9 jkt = 9 kjl = 8 jk A\§° j>i# j) + 



separate-min: iW = fc # = ^I(|e°j<|^|) + ^(|^ M |>|^|) 

The separate-min approach is always more conservative than separate-max in the sense that 
the former provides more zero estimates. It turns out that when the sample size is small, the 
separate-min approach is often too conservative to effectively identify non-zero parameters. 
More details are given in Section |4j 

Joint regularized logistic regression 

The second approach is to estimate the entire vector 6 simultaneously instead of estimating 
the Ofs separately, using the criterion, 

g 

min yY-(0;D n ) + A||0 vo ||i, 

j = l 

where 0\ = 6\{0u , 9 2 2o, • • • , 9 qq o}. The joint approach criterion can be written as one large 
penalized logistic regression by careful rearranging of terms. One obvious benefit of the joint 
approach is that 6 can be automatically symmetrized by treating 0j k and k j as the same 
during estimation. The price, however, is that it is computationally much less efficient than 
the separate approach. 

To fit the model using either the separate or the joint approach, we adopt the coordinate 



shooting algorithm in Fu (1998), where we update one parameter at a time and iterate 



until convergence. The implementation is similar to the glmnet algorithm of Friedman et al. 



(2010), and we omit the details here. 



3 Asymptotics: consistency of model selection 

In this section we present the model selection consistency property for the separate regular- 
ized logistic regression. Results for the joint approach can be derived in the same fashion 
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by treating the joint regression as a single large logistic regression. The spirit of the proof 



is similar to Ravikumar et al. (2010), but since their model does not include covariates cc, 
both our assumptions and conclusions are different. 

In this analysis, we treat the covariates CEj's as random vectors. With a slight change of 
notation, we now use 6j to denote Oj\o, dropping the intercept which is irrelevant for model 
selection. The true parameter is denoted by 6* . Without loss of generality we assume that 
9*j Q = 0, and we also assume that 9jjo = 0. 

First, we introduce additional notation to be used throughout this section. Let 

I* = E e *(V 2 \ogP e (y 3 \x,y Xj )) (7) 
= Kg* (pj(l — Pj){x <8> V\j)( x ® V\j) T ) (Information matrix) (8) 
U* = Ee*((x®y^)(x^y^ T ) (9) 

where 

Pj = Pj( x >V\j) = p o*(yj = l \ x ,V\j) > 
x®y\j = (l,xi,...,x p ) T ®(y 1 ,...,y j - 1 ,l,y j+1 ,...,y q ) T \{l}. 

Let Sj denote the index set of the non-zero elements of 6*, and let J5.5. be the submatrix 
of I* indexed by «Sj. Similarly defined are IstSj an d Is?s?> where Sj is the compliment set 
of Sj. Moreover, for any matrix A, let ||>l||oo = maxj J^ . \Aij\ be the matrix L ra norm, and 
let A min (74) and A max (y4) be the minimum and maximum eigenvalues of A, respectively. 

For our main results to hold, we make the following two assumptions for all q logistic 
regressions. 

Al There exists a constant a 6 (0, 1], such that 

11^5, (n&Y'w- <(i-«). 



A2 There exist constants A min > and A max > 0, such that 



Amin ylsjSjJ — ^min 
■ il maxV'- y j J — "max 

These assumptions bound the correlation among the effective covariates, and the amount of 
dependence between the group of effective covariates and the rest. Under these assumptions, 
we have the following result: 

Theorem 1 For any j — 1, . . . , q, let Oj be a solution of the problem 

min -^(Bj-Vn) + X n \\8j\\i. (10) 
Assume Al and A2 hold for I* and U*, and further assume that for some 5 > 

^(Nloo > M) < exp(-M' 5 ), for all M > M > 0, (11) 
Let d = maxj ||«Sj||o and C > a constant independent of (n,p, q). If 

M n > (C\ 2 n n)Th , (12) 

x n > c^^m , (is) 

n > CM*d 3 (\ogp + \ogq) , (14) 

the following hold with probability at least 1 — exp~ c( - x ™ n ^ (5* is a constant in (0, 1)), 

1. Uniqueness: Oj is the unique optimal solution for any j G {1, . . . , q}. 

2. i 2 consistency: \\0j — 6*\\ 2 < 5X n Vd/ A min for any j G {1, . . . , q} 

3. Sign consistency: Oj correctly identifies all the zeros in 0* for any j G {l,...,q}; 
moreover, Oj identifies the correct sign of non-zeros in 0* whose absolute value is at 
least lOX n Vd/ A min . 
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Theorem [l] establishes the consistency of model selection allowing both of the dimensions 
p(n) and q(n) to grow to infinity with n. The extra condition, which requires the distribution 



of x to have a fast decay on large values, was not in Ravikumar et al. (2010) as the paper 
does not consider covariates. The new condition is, however, quite general; for example, 
it is satisfied by the Gaussian distribution and all categorical covariates. The proof of the 
theorem can be found in the Appendix. 

4 Empirical performance evaluation 

In this section, we present three sets of simulation studies designed to test the model selection 
performance of our methods. We vary different aspects of the model, including sparsity, 
signal strength and proportion of relevant covariates. The results are presented in the form 
of ROC curves, where the rate of estimated true non-zero parameters (sensitivity) is plotted 
against the rate of estimated false non-zero parameters (1-specificity) across a fine grid of 
the regularization parameter. Each curve is smoothed over 20 replications. 

The data generation scheme is as follows. For each simulation, we fix the dimension of 
the covariates p, the dimension of the response q, the sample size n and a graph structure E 



in the form of a q x q adjacency matrix (randomly generated scale-free networks (Barabasi 



& Albert, 1999). For any (j,k), 1 < j < k < q, {0jko } O- k ) consists of (p + 1) independently 
generated and selected from three possible values: (3 > (with probability p/2), —(3 (with 
probability p/2), and (with probability 1 — p). An exception is made for the intercept 
terms 9jjo, where p is always set to 1. Covariates x l, s are generated independently from the 
multivariate Gaussian distribution N p (0, I p ). Given each x l and 6, we use Gibbs sampling to 
generate the y l , where we iteratively generate a sequence of y^'s (j — 1, . . . q) from a Bernoulli 
distribution with probability Pe{y) = x *) an d take the last value of the sequence when 

a stopping criterion is satisfied. 

We compared three estimation methods: the separate-min method, the separate-max 
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method and the joint method. Our simulation results indicate that performance of the 
separate-min method is substantially inferior to that of the separate-max method in almost 
all cases (results omitted for lack of space). Thus we only present results for the separate-max 
and the joint methods in this section. 

4.1 Effect of sparsity 

First, we investigate how the selection performance is affected by the sparsity of the true 
model. The sparsity of can be controlled by two factors: the number of edges in E, 
denoted by n E , and the average proportion of effective covariates for each edge, p. We fix 
the dimensions q = 10, p = 20 and the sample size n = 200, and set the signal size to /3 = 4. 
Under this setting, the total number of parameters is 1155. The sparsity parameter n E takes 
values in the set {10, 20, 30}, and p takes values in {0.2, 0.5, 0.8}. 

The resulting ROC curves are shown in Figure [TJ The first row shows the results of the 
joint approach and the second row of the separate-max approach. As the true model becomes 
less sparse, the performance of both the joint and the separate methods deteriorates, since 
sparse models have the smallest effective number of parameters to estimate and benefit the 
most from penalization. Note that the model selection performance seems to depend on the 
total number of non-zero parameters ((q+riE)(p+l)p), not just on the number of edges (%). 
For example, both approaches perform better in case % = 20, p = 0.2 than n E = 10, p = 0.5, 
even though the former has a more complicated network structure. Comparing the separate- 
max method and the joint method, we observe that the two methods are quite comparable, 
with the joint method being slightly less sensitive to increasing the number of edges. 

Note that the "*" point on each curve represents the average sensitivity and (1-specificity) 
over the replications based on an "optimal" A, selected by maximizing the conditional log- 
likelihood on an independent validation dataset of the same size as the training data. 
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Figure 1: ROC curves for varying levels of sparsity, as measured by the number of edges (n^) and 
expected proportion of non-zero covariates (p). The star on each curve corresponds to an optimal 
value of A selected on an independent validation set. 

4.2 Effect of signal size 

Second, we assess the effect of signal size. The dimensions are set to be the same as in 
the previous simulation, that is, q = 10, p = 20 and n = 200, and underlying network is 
the same. The expected proportion of effective covariates for each edge is p = 0.5. The 
signal strength parameter (3 takes values in the set {0.5, 1,2,4,8, 16}. For each setting, the 
non-zero entries of the parameter vectors 6 are at the same positions with the same signs, 
only differing in magnitude. The resulting ROC curves are shown in Figure [2] 

As the signal strength increases, both the separate and the joint methods show improved 
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Joint Approach Separate-Max Approach 




False Positive Rate False Positive Rate 

Figure 2: ROC curves for varying levels of signal strength, as measured by the parameter /3. The 
star on each curve corresponds to an optimal value of A selected on an independent validation set. 

selection performance, but the improvement levels off eventually. Both methods achieve 
almost the same "optimal" sensitivity and specificity (the V point), with the separate-max 
method performing better overall. 

4.3 Effect of noise covariates 

In the last set of simulations, we study how the model selection performance is affected 
by adding extra uninformative covariates. At the same time, we also investigate the effect 
of the number of relevant covariates Pt rU e an d the sample size n. The dimension of the 
response is fixed to be q = 10 and the network structure remains the same as in the previous 
simulation. We take Ptrue e {10>20} and n G {200,500}. For each combination, we first 
fit the model on the original data and then on augmented data with extra uninformative 
covariates added. The total number of covariates P^otal € ij°true' 50, 200}. The non-zero 
parameters are generated the same way as before with (3 = 4 and p = 0.5. With the changes 
in Ptotal' ^ e total number of non-zero parameters remains fixed for each value of Ptruc 
while the total number of zeros is increasing. 
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To make the results more comparable across setting, we plot the counts rather than rates 
of true positives and false positives. The resulting curves are shown in Figure [3] Generally, 
performance improves when the sample size grows and deteriorates when the number of 
noise covariates increases, particularly with a smaller sample size. The separate- max method 
dominates the joint method under these settings, but the difference is not large. 



Joint Approach, P, rue =10 



Separate-Max Approach, P, rue =10 
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Figure 3: ROC curves for varying dimension, number of noise covariates, and sample size. 
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5 Application to tumor suppressor genes study 

In breast cancer, deletion of tumor suppressor genes plays a crucial role in tumor initiation 
and development. Since genes function through complicated regulatory relationships, it is 
of interest to characterize the associations among various deletion events in tumor samples, 
and at the same time to investigate how these association patterns may vary across different 
tumor subtypes or stages. 

Our data set includes DNA copy number profiles from cDNA microarray experiments on 



143 breast cancer specimens (Bergamaschi et al. 2006). Among them, 88 samples are from 



a cohort of Norwegian patients with locally advanced (T3/T4 and/or N2) breast cancer, 
receiving doxorubicin (Doxo) or 5 fluorouracil/mitomycin C (FUMI) neoadjuvant therapy 



(Geisler et al. , 2003). The samples were collected before the therapy. The other 55 are from 



another cohort of Norwegian patients from a population-based series (Zhao et al. , 2004). 
Each copy number profile reports the DNA amounts of 39,632 probes in the sample. The 
array data was preprocessed and copy number gain/loss events were inferred as described 



in 



Bergamaschi et al. ( 2006[ ). To reduce the spatial correlation in the data, we bin the 



probes by cytogenetic bands (cytobands). For each sample, we define the deletion status of 
a cytoband to be 1 if at least three probes in this cytoband show copy number loss. 430 
cytobands covered by these probes show deletion frequencies greater than 10% in this group 
of patients, and they were retained for the subsequent analysis. The average deletion rate 
for all the 430 cytobands in 143 samples is 19.59%. Our goal is to uncover the association 
among these cytoband-deletion events and how the association patterns may change with 
different clinical characteristics, including TP53 mutation status (a binary variable), estrogen 
receptors (ER) status (a binary variable), and tumor stage (an ordinal variable taking values 
in {1,2,3,4}). 

For our analysis, denote the array data by 2/143x430, where indicates the deletion status 
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of the j th cytoband in the i th sample. Let x l denote the covariate vector containing the three 
clinical phenotypes of the i th sample, and X\ the Ith covariate vector. We first standardize 
the covariate matrix #143x3 and then fit our Ising model with covariates with the separate- 



max fitting method. We then apply stability selection (Meinshausen & Biihlmann, 2010) 
to infer the stable set of important covariates for each pairwise conditional association. 
Specifically, we repeatedly fit the model 100 times on subsamples containing half the data 
selected randomly without replacement. For each tuning parameter A from a fixed grid of 
values, we record the frequency of 9jki being non-zero respectively for each covariate Xi, 
I = 0,1,2,3 on all pairs of (j, k), 1 < j < k < 430, and denote it by fjkiW- Note that 
Xq corresponds to the main effect interaction between a pair of y^s and does not involve 
any covariates. Then we use f* kl = ma.x\ fjki{^) as a measure of importance of covariate X\ 
for the edge (j, k). Finally, for each covariate Xj, we rank the edges based on the selection 
frequencies {f* kl : 1 < j < k < q}. At the top of the list are the edges that depend on 
Xj most heavily. We are primarily interested in the pairs of genes belonging to different 
chromosomes, as the interaction between genes located on the same chromosome is more 
likely explained by strong local dependency. The results are shown in Table 1, where the 
rank list of the edges depending on different covariates are recorded. The first two columns 
of each covariate related columns are the node names and the third columns record the 
selection frequency. 

There are 332 inter-chromosome interactions (between cytobands from different chromo- 
somes) with selection probabilities at least 0.5. Among these, 39 interactions change with 
the TP53 status; 12 change with the ER status; and another 12 change with the tumor grade 
(see details in Table [T]). These results can be used by biologists to generate hypotheses and 
design relevant experiments to better understand the molecular mechanism of breast cancer. 

The most frequently selected pairwise conditional association is between deletion on 
cytoband 4q31.3 and deletion on 18q23 (94% selection frequency). Cytoband 4q31.3 harbors 

16 



the tumor suppressor candidate gene SCFFbw7, which works cooperatively with gene TP53 



to restrain cyclin E-associated genome instability (Minella et al. , 2007). Previous studies 



also support the existence of putative tumor suppressor loci at cytoband 18q23 distal to 



the known tumor suppressor genes SMAD4, SMAD2 and DCC (Huang et al., 1995 Lassus 



et al. 2001). Thus the association between the deletion events on these two cytobands is 
intriguing. 

Another interesting finding is that the association between deletion on cytoband 9q22.3 
region and cytoband 12pl3.31 appears to be stronger in the TP53 positive group than in the 
TP53 negative group. A variety of chromosomal aberrations at 9p22.3 have been found in 



different malignancies including breast cancer (Mitelman et al. 1997). This region contains 
several putative tumor suppressor genes (TSG), including DNA-damage repair genes like 
FANCC and XPA. Alterations in these TSGs have been reported to be associated with 



poor patient survival (Sinha et al. 2008[ ). On the other hand, cytoband 12pl3.31 harbors 



another TSG, namely ING4 (inhibitor of growth family member 4), whose protein binds 
TP53 and contributes to the TP53-dependent regulatory pathway. A recent study also 
suggests involvement of ING4 deletion in the pathogenesis of HER2-positive breast cancer. 
In light of these previous findings, it is interesting that our analysis also found the association 
between the deletion events of 9p22.3 and 12pl3.31, as well as the changing pattern of the 
association under different TP53 status. This result suggests potential cooperative roles for 
multiple tumor suppressor genes in cancer initiation and progression. 

We also searched the network for hubs (highly connected nodes), which often have impor- 
tant roles in genetic regulatory pathways. Since there can be different hubs associated with 
different covariates, we separate them as follows. For each node j, covariate I, and stability 
selection subsample m, let the "covariate-specific" degree of node j be = #{k : 9^ ^ 0}. 
A ranking of nodes can then be produced for each covariate I and each replication m, with 
rj] being the corresponding rank. Finally, we compute the median rank across all stability 
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Table 1: Frequency-based ranked list of covariate-dependent inter-chromosomal interactions 



Main effect 


TP53 mutation status 


ER status 


Genel 


Gene2 


Freq 


Genel 


Gene2 


Freq 


Genel 


Gene2 


Freq 


4q31.3 


18q23 





95 


3p22.2 


22ql3.1 


0.79 


3q26.1 


llpl4.3 





69 


2p25.2 


15q26.2 





87 


3pl2.3 


12pl3.1 


0.72 


4q34.3 


5q32 





64 


2q36.3 


3p26.1 





84 


12q22 


15ql4 


0.7 


8pll.22 


llpl4.2 





63 


7q21.13 


8q21.13 





84 


2pl2 


Xp22.33 


0.69 


3q24 


22qll.23 





57 


6p21.32 


16ql2.2 





83 


6p21.32 


8pll.22 


0.68 


4pl4 


llpl5.3 





55 


3p21.1 


17pl3.2 





81 


lp34.2 


3p24.1 


0.67 


lq31.1 


Xq27.3 





54 


4q24 


12q21.1 





81 


2p21 


Xpll.22 


0.67 


13q33.2 


22qll.23 





54 


2q23.3 


6pl2.1 





79 


2pl2 


7p21.1 


0.66 


21q21.1 


22qll.21 





54 


8p21.3 


21q21.1 





79 


12ql5 


13ql2.12 


0.63 


5q33.1 


17q21.31 





53 


2q34 


3ql3.31 





78 


4q25 


8pll.22 


0.62 


12q21.32 


18q22.3 





51 


6p21.32 


9q31.3 





78 


8pll.22 


Xq23 


0.62 


8pll.22 


22qll.21 





5 


6p21.32 


13q21.1 





78 


9p21.2 


16q22.1 


0.61 


8q21.13 


Xp22.11 





5 


6p21.31 


llpl5.2 





78 


3p21.1 


llql4.1 


0.58 










llpl5.1 


14q22.2 





78 


3pl3 


9p24.2 


0.58 










lp36.11 


2p21 





77 


9q22.32 


12pl3.31 


0.57 










lp31.1 


2q32.2 





76 


7q21.3 


22ql2.3 


0.56 


Tumor stage 


lq31.1 


22qll.21 





76 


3q26.1 


llpl3 


0.55 


Genel 


Gene2 


Freq 


2q32.1 


6ql4.1 





76 


4q35.2 


22ql2.3 


0.55 


16q23.3 


17pl3.1 





61 


9q21.11 


16q21 





76 


15q22.33 


17pll.2 


0.55 


12pll.23 


16ql2.2 





59 


9q31.3 


14q24.3 





76 


3p22.1 


6p21.31 


0.54 


3ql3.13 


Xq23 





57 


10q25.3 


12pl3.31 





76 


4q28.2 


7q21.13 


0.54 


7p21.3 


12pll.23 





56 


4q35.1 


15q22.2 





75 


5ql3.1 


6q22.33 


0.54 


9q34.13 


15q21.1 





55 


3p21.31 


17pll.2 





74 


5q23.2 


8p21.2 


0.54 


llq24.2 


13q32.3 





55 


6p21.32 


13q31.2 





74 


16q22.1 


17q21.31 


0.54 


8q21.13 


13q33.1 





54 


10qll.21 


12pl3.32 





74 


4q28.3 


9p21.3 


0.53 


2p21 


12pl3.31 





53 


9q33.1 


14ql2 





73 


4q35.1 


9p21.3 


0.53 


10q26.3 


17pll.2 





53 


12pl3.31 


17qll.2 





73 


4q35.2 


16q22.1 


0.53 


7p21.3 


12pl2.1 





51 


lp34.2 


3p22.1 





72 


2q31.3 


4ql3.2 


0.52 


3ql3.13 


7p21.3 





5 


5q33.1 


llpl5.4 





72 


3p26.1 


14ql3.1 


0.52 


9q34.13 


15q22.1 





5 


6ql2 


20pl2.1 





72 


4pl6.1 


13q31.1 


0.52 










12pl2.2 


Xpll.4 





72 


6p21.31 


llql4.2 


0.52 










4q35.2 


9p21.2 





71 


3p25.1 


Ilpl5.2 


0.51 










llpl5.2 


18ql2.1 





71 


5ql4.2 


Xq27.1 


0.51 










lp21.1 


7q21.12 





7 


5ql4.2 


Xq27.2 


0.51 










2pl6.1 


6pl2.3 





7 


8pll.22 


15ql4 


0.51 










2q31.2 


3p26.2 





7 


10q23.32 


21q21.1 


0.51 










2q36.3 


9q22.31 





7 


16q22.1 


17pl3.2 


0.51 










3p22.1 


15q25.3 





7 


3p22.1 


5q33.3 


0.5 










6p21.32 


Xpll.4 





7 


5ql4.2 


17q21.2 


0.5 
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selection subsamples = median^" 1 ;, m 



, 100}, and order nodes by rank for each 



covariate. The results are listed in Table[2j Interestingly, cytoband 8pll.22 was ranked close 
to the top for all three covariates. The 8pll-pl2 genomic region plays an important roles 
in breast cancer, as numerous studies have identified this region as the location of multiple 



oncogenes and tumor suppressor genes (Yang et al. 2006 Adelaide et al. 1998). High fre- 



quency of loss of heterozygosity (LOH) of this region in breast cancer has also been reported 



(Adelaide et al. 1998). Particularly, cytoband 8pll.22 harbors the candidate tumor sup- 



pressor gene TACC1 (transforming, acidic coiled-coil containing protein 1), whose alteration 



is believed to disturb important regulations and participate in breast carcinogenesis (Conte 



eFaLj |2~0~0~2~1 ). From Table [XJ we can also see that the deletion of cytoband 8pll.22 region is 
associated with the deletion of cytoband 6p21.32 and llpl4.2 with relatively high confidence 
(selection frequency > 0.6); and these associations change with both TP53 status and ER 
status. This finding is interesting because high frequency LOH at 6q and lip in breast cancer 
cells are among the earliest findings that led to the discovery of recessive tumor suppressor 



genes of breast cancer (Ali et al. , 1987 Devilce et al. 1991 Negrini et al. 1994). Moreover 



there is evidence that allele loss of c-Ha-ras locus at llpl4 correlates with paucity of oestro- 



gen receptor protein, as well as patient survival (MacKay et al. , 1988 Garcia et al. 1989). 



These results together with the associations we detected confirm the likely cooperative roles 
of multiple tumor suppressor genes involved in breast cancer. 

6 Summary and Discussion 

We have proposed a novel Ising graphical model which allows us to incorporate extraneous 
factors into the graphical model in the form of covariates. Including covariates into the model 
allows for subject-specific graphical models, where the strength of association between nodes 
varies smoothly with the values of covariates. One consequence of this is that if all covariates 
are continuous, there is probability of the graph structure changing with covariates, and 
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Table 2: Degree-based ranking of nodes 



Main effect 


TP53 mutation status 


ER status 


Tumor stage 


Gene 


Median rank 


Gene 


Median rank 


Gene 


Median rank 


Gene 


Median rank 


lp36.11 


16.75 


8pll.22 


12.75 


3q26.1 


10 


16q23.1 


19.25 


lq31.1 


21 


lp31.3 


14.5 


lq31.1 


12 


10qll.23 


22.25 


6p21.31 


24.25 


3p22.2 


25.25 


3p22.2 


13 


16ql2.2 


23.5 


6p21.32 


37 


lq31.1 


28.75 


8q21.13 


14 


9q34.13 


27.5 


2pl2 


38.5 


12q23.1 


32 


10q22.1 


15.25 


22qll.23 


27.75 


2q32.2 


43 


2pl6.2 


33.5 


8pll.22 


19 


12pll.23 


33 


8q21 13 


44.5 


4q31.1 


41.75 


3p21.1 


20.25 


2q33.1 


35.25 


6pl2.3 


45.5 


9p21.3 


42 


llq23.3 


22 


8pll.22 


35.75 


2q32.3 


53.75 


7q21.3 


44.25 


5ql3.1 


28 


10q25.2 


36 


3p22.2 


54.25 


3q26.1 


44.75 


4pl6.1 


33 


llql4.1 


40.5 


6pl2.1 


57.5 


12ql5 


45.5 


5ql3.3 


34 


10pl2.2 


41.5 


1 P 31.3 


59.25 


12pll.22 


51.5 


9p22.3 


36.25 


3ql3.13 


42 


21q21.1 


60 


15q22.1 


51.5 


8p21.3 


41.25 


13ql3.2 


42.75 


3q26.1 


73.25 


15q23 


51.75 


3p25.1 


42.5 


16ql2.1 


47 


12pll.22 


73.25 


8q21.13 


54 


10q23.2 


42.75 


6p21.31 


50 


6q26 


74.5 


9p21.2 


54.5 


5q32 


47 


llq22.2 


53 


13q32.1 


75.75 


21q21.1 


55.25 


lp36.11 


47.5 


10q26.3 


53.5 


17pl3.2 


78 


9q34.13 


59 


Xp22.22 


48.75 


9q33.1 


55.5 


llq!4.1 


80.25 


9p24.2 


62 


21q21.1 


49 


4q21.1 


56 



only the strength of the links is affected. With binary covariates, which is the case in our 
motivating application, this situation does not arise, but in principle this could be seen as 
a limitation. On the other hand, this is a necessary consequence of continuity, and small 
changes in the covariates resulting in large changes in the graph, as can happen with the 



approach of Liu et al. (2010), make the model interpretation difficult. Further, our approach 
has the additional advantage of discovering exactly which covariates affect which edges, 
which can be more important in terms of scientific insight. 

While here we focused on binary network data, the idea can be easily extended to cate- 
gorical and Gaussian data, and to mixed graphical models involving both discrete and con- 
tinuous data. Another direction of interest is understanding conditions under which methods 
based on the neighborhood selection principle of running separate regressions are preferable 
to pseudo-likelihood type methods, and vice versa. This comparison arises frequently in the 
literature, and understanding this general principle would have applications far beyond our 
particular method. 
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Appendix: Proof of Theorem [l] 

For notational convenience, we omit the j indexing each separate regression. Following the 
literature, we prove the main theorem in two steps: first, we prove the result holds when 
assumptions Al and A2 hold for I n and U n , the sample versions of of /* and U* defined 
in (J7|) (Proposition [TJ. Then we show that if Al and A2 hold for the population versions 
I* and U*, they also hold for I n and U n with high probability (Proposition [2]). The sample 
quantities I n and U n are defined as 



1 



i=l 



u 



n 
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Proposition 1 If Al and A2 are satisfied by I n and U n , assume moreover that 

M n = supllajHoo < oo a.s., 



A > 8M n (2-a) /logp + logg 



a v n 

Hi 



n > Cd (logp + logg) . 

Then with probability at least 1 — 2exp ^— C^^J, the result of Theorem^ holds. 

Proof of Proposition [7J The proof requires several steps. The uniqueness part follows 
directly from the following lemma: 



Lemma 1 (Shared sparsity and uniqueness of 0, Ravikumar et al. (2010)). Define the sign 

vector t for 6 to satisfy the following properties, 

i k = sign(6 k ), if 6 k ^ , 
|4|<1, ifd k = 0- 

Suppose there exists an optimal solution 8 with sign t defined as above, such that, llt^clloo < 

1, then any optimal solution 6 must have 6 s c = 0. Furthermore, if the Hessian matrix 

V 2 £(6)ss is strictly positive definite, then is the unique solution. 



We now proceed to prove the rest of Proposition |1| For 6 to be a solution of (10), the 
sub-gradient at 6 must be 0, i.e., 

V£(0,V n ) + \ n t = . (15) 
Then we can write VI \0, X>„) - VI (0*, V n ) = -\ n i + W n , where 

W n = _ W (0*,P n ) = ® y{ 3 ){y) -p)(0*)) ■ 

i=l 

Let 6 denote a point in the line segment connecting 6 and 0* . Applying the mean value 
theorem gives 

r (e - = w n - x n t + R n . (16) 
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where R n = (v 2 £{0*,V n ) - V 2 £{0,V n )^j (0 - 0*). 

Now define as follows: let S be the index set of true non-zeros in 0* , let 0$ be the 
solution of 

min^ig + Ajdslli , (17) 

(05,0) 

and let =0. We will show that this is the optimal solution and is sign consistent with 
high probability. 

We set the corresponding sign vector is for 0$ similarly defined as in Lemma [lj and i s c = 



— V s c£(0s, T> n ) as obtained in (15). Now we need to show that with high probability, 



It, Hoc < 1, for jeS c (U 

10X n Vd 



tj = sign{0*), for j e S and \\9*\\ > A " (19) 



'nun 



The following three lemmas form the proof. 



Lemma 2 (Control the remainder term W n ). For a e (0, 1], assume ||a?||oo < M n a.s, then, 

.'2 — a.. TirTll , ^ a\ ^ A ( Xlna 2 . . 

1 I -j^-ll^ll- £ jj £4exp {- 32M ? (2 _ a y +lo e p + lo eq 



This probability goes to as long as \ n > 8M^ W logp+logg 



Proof of Lemma 2. We can write W n = ^ YH=i{ x% ® y\j){y) ~ Pj(0*)) — J2i=t %h where 
is bounded by M n /n. Thus by Azuma-Hoeflding Inequality, 

p (V n IU > tt^t! < 2 M p(W|U> A " n 



4(2 -a); " ^ V 4 (2-« 



" 4eXpl ' 32M/(2-^ +lQgP + l0g9) - 



□ 



A 2 \ 

Lemma 3 (^-consistency of the sub-vector 0s). If \ n d < 10A min M 7 HW^Hoo < 4 ; 

^min 
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Proof ofLemmaQ Let G{u s ) = £(0 s + u s ,V n ) - £{0* s ,V n ) + X n {\\0* s + u s \\i - \\0* s \\i) be a 
function G : R d -»■ M. It is easy to see that G(us) is convex and it achieves its minimum at 
us = 0s — 0*s- Moreover, G(0) = 0. Thus if we can show that G(us) is positive on the set 
I \us || 2 = -B, then we will have us < B due to convexity of G{us)- Note that 

G(tt 5 ) = -Wfu s + u T s VH{0% + a« s )« s + X n {\\0% + u s \\ l - 110*110 . 

Further, 

\Wfu s \ < ||W"||ooKI|i < ^dWush , 
A mi „(V 2 £(6l* + au s )) > A miQ - A max M n Vd\\ush , 
|A n (||^ + W5||i-||05lli)l < KVd\\u s h ■ 
Combining all of the above, we have 

G(u s ) > \\u s \\ 2 (-A max M n Vd\\u s \\ 2 2 + A min ||ii 5 || 2 - -X n Vd) . 
Easy algebra shows that if X n d < , nA A ° lin M and B = the result follows. □ 

A 2 \ 

Lemma 4 (Control the remainder term R n ). If X n d < 100M m ^ — Halloo < 4 , then 

< —r^ M n X n d < 



An " A^ n n n "4(2 -a) • 
Proof of Lemma [^} Recall that 

iT = (vH(0\v n )-vH(0,v n )) (0-0*) 

Let ujj(0) = Pj(0)(l — p l A0)). The k-th. element of R n has the form 

1 n 

l - jr u)(0)zi {0* - 0) T (x i ® y^.)^ ® (« - 



i=i 
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where Z\ = x\y l m} for some (l,m). By Al and Lemma pi we have 



\R±\ — M n A max \\0 — 6*\\\ < M n A max 



5X n \/d 

m i r 



□ 



Putting all the lemmas together, we are ready to prove Proposition [TJ 
Proof of Proposition^ Set A n = 8M "^ ^M±^m_ By Lemma Q we have WW 11 ]]^ < 

< *f with probability at least l-4exp(CX 2 n n/M 2 ). Choosing n > 10 ° 2 ^H^"^ ' d 2 (\ogp+ 

*• ' min 



logg)), we have X n d < 



100 Af„ "max 2 Q; 



— , thus the conditions of Lemmas 



^andQ 



hold. 



By rewriting (16) and utilizing the fact that 6 s c = 6* sC = 0, we have 



!sc s (6 s - 6* s ) 



Iss(0s - 0s) 



Wgc - \ n t s c + R n sc 



- X n t s + . 



(20) 
(21) 



Since I$ s is invertible by assumption, combining (20) and (21) gives 



i n s c s (r ss rm - Kt s + R n s ) = W^c - X n t s c + R n sc 



(22) 



To show (18), we reorganize (22) and use results from Lemmas 2 and 4 



An . t 



nll^cS 1 -" I loo 



IsCsVssrm - Ms + Rs) - W-a ~ R n s 



C \\oo 



< ll/ n cccf/'Jc)- 1 IU(l| W^IL + A n + ||i? n |D + IIW^IL + \\R r ' 



< XJ1 



S C S\ J -SSJ \\oo\ 

a , 



To show (19), it suffices to show that \\ds — #s||oo < -™. By Lemma 



\Os-o: 



< 



5X n \/d ^ 81 



'S " ".SHOO _ A 



< 



The last inequality follows as long as fl ^ in > 1 ^" v ^ . This completes the proof of Proposition 



D 



□ 
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Proposition 2 If I* and U* satisfy Al and A2, and M n = sup||£c||oo < oo a.s., the 
following hold for any 5 > 0. A and B are some positive constants. 

5 2 n 



P |A max ^ J2(x { ® yij) (x 1 ® y l v ) T j > D XI 



5} < 2exp —A 



Mid 2 



B(\ogp + hgq) 



P {Amin(Iss) < Cmin - 6) < 2 eX P —A 



5 2 n 
Mid 2 



+ B log 
£?(logp + logg) 



We omit the proof of Proposition [2j which is very similar to Lemmas 5 and 6 in Ravikumar 



et al. (2010). 



Proof of Theorem [1} With Propositions [T] and [2| the proof of Theorem [T] is straightforward. 



Given that Al and A2 are satisfied by /* and U* and that conditions (JX3[) and ( 14 ) hold, on 



the set A = {x : M n = sup ||cc|| < oo} the assumptions in Proposition [2] are satisfied. Thus 
with probability at least 1 — exp(— ^p^ ), the conditions of Proposition jlj hold, and therefore 
the results in Theorem [T] hold. Finally, let T stand for the set where the results of Theorem 
[TJhold. Then by (11) and (12), we have 



P{T C ) < P(T C | A)+P{A C ) < exp( 



Ml 



)+exp(-M^) < exp-(C'\ 2 n n) d , where < 5* < 1. 



□ 
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